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BACKWARD AND FORWARD MARKOV CHAIN MONTE CARLO 1 

By Radu V. Craiu and Xiao-Li Meng 

University of Toronto and Harvard University 

Antithetic coupling is a general stratification strategy for reduc- 
ing Monte Carlo variance without increasing the simulation size. The 
use of the antithetic principle in the Monte Carlo literature typically 
employs two strata via antithetic quantile coupling. We demonstrate 
here that further stratification, obtained by using k > 2 (e.g., k = 3- 
10) antithetically coupled variates, can offer substantial additional 
gain in Monte Carlo efficiency, in terms of both variance and bias. 
The reason for reduced bias is that antithetically coupled chains can 
provide a more dispersed search of the state space than multiple in- 
dependent chains. The emerging area of perfect simulation provides 
a perfect setting for implementing the fc-process parallel antithetic 
coupling for MCMC because, without antithetic coupling, this class 
of methods delivers genuine independent draws. Furthermore, anti- 
thetic backward coupling provides a very convenient theoretical tool 
for investigating antithetic forward coupling. However, the genera- 
tion of k > 2 antithetic variates that are negatively associated, that is, 
they preserve negative correlation under monotone transformations, 
and extremely antithetic, that is, they are as negatively correlated 
as possible, is more complicated compared to the case with k = 2. 
In this paper, we establish a theoretical framework for investigating 
such issues. Among the generating methods that we compare, Latin 
hypercube sampling and its iterative extension appear to be general- 
purpose choices, making another direct link between Monte Carlo 
and quasi Monte Carlo. 

1. Paired antithetic coupling. Monte Carlo estimation of the expecta- 
tion of an estimand function / with respect to a probability measure it, 
fi = J f(x)ir(dx), can be done in many ways. The simplest method, known 
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as "crude Monte Carlo," proceeds by simulating a (not necessarily inde- 
pendent) sample Xi, X2, ■ ■ ■ , X n from it and estimating fj, by the sample 
average jl n = f(Xi)- More refined methods, collectively known as swin- 
dles, take advantage of well-known statistical principles to construct more 
efficient designs and/or more efficient estimators. In this paper we focus on 
antithetic coupling, which can be viewed as a way to induce stratification. 
For other type of swindles see Hammersley and Handscomb (1965), Simon 
(1976), Kennedy and Gentle (1980) and Gentle (1998), among others. 

In the context of classic independent sampling, antithetic coupling [Hammersley 
and Morton (1956)] is commonly described as a method of producing two 
negatively correlated copies of an unbiased estimator. The average of the 
two, each based on m draws, will then have smaller variance compared to 
the estimator based on the n = 2m independent draws. Substantial reduc- 
tions have been documented throughout the literature from early works, such 
as Page (1965) and Fishman (1972), to most recent ones such as Frigessi, 
Gasemyr and Rue (2000), who demonstrate the power of the antithetic prin- 
ciple in the more complicated (forward) MCMC setting. 

The negative correlation between the two copies is typically induced 
via antithetic quantile coupling by using a pair {U,l — U}, where U ~ 
Uniform(0, 1), in the sample generating process. The amount of variance 
reduction is governed by the degree of symmetry in the distribution of our 
estimator. In the extreme case when this distribution is symmetric, the use 
of paired antithetic variates can entirely eliminate the Monte Carlo variance 
when the underlying draws are independent or reduce the variance from 
the usual n" 1 rate to n~ 2 rate when the draws are realizations of a gen- 
uine MCMC algorithm, as observed by Frigessi, Gasemyr and Rue (2000). 
This emphasizes the stratification aspect of the antithetic principle as a way 
to divide the sample space into a "negative" stratum and a "positive" stra- 
tum. The equal amount of draws from each stratum ensured through pairing 
brings in further variance reduction compared to using simple random sam- 
pling within each stratum. Generalizing the antithetic coupling from k = 2 
to k > 2 processes is quite natural from this stratification point of view, as 
often more than two strata can produce substantial additional gain. A main 
goal of this paper is to demonstrate such gains in the context of MCMC, as 
well as the additional benefit of improving the mixing of the original Markov 
chains. 

The paired quantile coupling via {U, 1 — U} has the following extreme 
antithesis (EA) property that is usually not emphasized in the literature. 
Specifically, if F is an arbitrary univariate cumulative distribution function 
(CDF), and X x = F _1 (17), X 2 = - U), where U ~ Uniform(0, 1), then 

Corr(Ai, X2) achieves the minimal possible value subject to the constraint 
that X\,X2 ~ F. The proof [Moran (1967)] relies on the elegant Hoeffding 
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identity 

(1.1) Cav(X u X 2 ) = J J [F(xi,x 2 ) -F(x 1 )F(x 2 )]ti(dx 1 dx 2 ), 

and the equally elegant Frechet (1951) inequality 

(1.2) max{F( Xl ) + F(x 2 ) - 1,0} < F( Xl ,x 2 ) < mm{F(xi),F(x 2 )}, 

where F(xi,x 2 ) is the joint CDF. The fact that the single strategy of us- 
ing {U, 1 — U} achieves EA simultaneously for all F also implies that it 
achieves EA for Con(g(Xi) , g(X 2 )) , where g is any monotone function such 
that J g 2 (x)F(dx) < oo. It is known that negative correlation is not even 
preserved by monotone transformations, so the above discussion implies 
that {U, 1 — U} must satisfy a stronger condition, negative association (NA) 
[Joag-Dev and Proschan (1983)], which requires exactly that the negative 
correlation be preserved by monotone functions of the variates (see Sec- 
tion 3.1). The fact that {U, 1 — U} is automatically NA appears to be re- 
sponsible for the general silence of the NA requirement in the literature of 
antithetic coupling, but once we move beyond k = 2, NA (as well as EA) 
becomes a key notion in our general theoretical foundation. 

The rest of the paper is divided into three sections. Section 2 presents 
the general ^-process parallel antithetic coupling technique, illustrated with 
a backward MCMC mixture sampling and two forward MCMC sampling 
schemes. However, for k > 2, as we demonstrate in Section 3, there is no 
general strategy that achieves EA simultaneously even just for uniform and 
normal distributions. As a result, it is harder to ensure NA and EA for a 
given problem, especially in theory. Section 3 is thus devoted to a general 
theory for ensuring NA and EA for arbitrary k, which is our main theoretical 
contribution. Section 4 discusses several common methods for generating k 
antithetic variates and their general properties with respect to achieving EA 
and NA. 

2. fc-process parallel antithetic coupling and illustration. 

2.1. Perfect simulation and time-backward dual sequence. Since the sem- 
inal work by Propp and Wilson (1996) on coupling from the past (CFTP), 
there has been an array of methodological and theoretical papers on how to 
use backward coupling methods for exact MCMC sampling. David Wilson's 
web site (http://dimacs.rutgers.edu/~dbwilson/exact) is the most compre- 
hensive single source for learning about the fast-growing field of exact sam- 
pling or perfect simulation, so named because, for this class of sampling 
methods, the thorny issue of deciding the running time for an acceptable er- 
ror in approximating the distribution of interest, 7r, completely vanishes. A 
CFTP algorithm, or Fill's algorithm (1998), or many of their variations and 



4 



R. V. CRAIU AND X.-L. MENG 



generalizations [e.g., Fill, Machida, Murdoch and Rosenthal (2000), Meng 
(2000) and Wilson (2000b)] will terminate itself with probability 1 in a fi- 
nite amount of time and yet delivers authentic (and hence exact/perfect) 
independent draws from the limiting distribution. 

This can be achieved by constructing, for a Markov chain {Xt}t>o with 
stationary distribution ir, the "dual" sequence Xt such that it has the same 
distribution as Xt but it purposely violates the Markovian property. This is 
perhaps most easy to illustrate by first considering a time-inhomogeneous 
Markov chain defined by Xt = iftt(Xt-i, Ut), where {Ut, t € AT} are i.i.d. ran- 
dom variables. For any t > and any starting value X* , we can compute a 
time-forward sequence 

(2.1) x t = Mi>t-i(- ■ -MX*, i/i) • • • , u t -!), u t ), 

as well as a time-backward sequence 

(2.2) x t = MM--- Mx*,u t ) ...,u 2 ), Ui). 

Clearly, X t is Markovian because X t = ipt(X t -i, Ut), but X t is not. 

Evidently, when ipt is time- homogeneous, Xt and Xt have identical dis- 
tributions for any t. What is gained by giving up the Markovian property is 
that the backward sequence Xt can "hit and stay at" the limit [Thorisson 
(2000), Chapter 1] in a finite amount of time when its forward counterpart Xt 
cannot. This is easiest to see in an extreme case where ip(X, U) = U, which, 
say, is uniform on (0, 1). Then it is obvious that Xt = Ut, but Xt = U\, 
for all t > 1. In other words, while both Xt and Xt converge, trivially, to 
Uniform(0, 1) in distribution, Xt wanders off from Ut to Ut+\, even if it hits 
the right limiting distribution already at t=l. In contrast, Xt stays at the 
exact same value U\ for all t > 1. In general, assuming Xt is uniformly er- 
godic, one can conclude that while X t converges in distribution, there exist 
an updating tp and a finite stopping time T such that Xt hits the same limit 
with probability 1 [Foss and Tweedie (1998)]. By mapping t to —t in (2.2), 
and thereby creating a convenient forward execution of the time-backward 
sequence given in (2.2), Propp and Wilson (1996) devise the CFTP method 
for finding such a T. 

At the crux of CFTP implementation is the ability to follow a large, pos- 
sibly infinite, number of paths in time and to assess whether these paths 
have all merged after a certain time T. For a good introduction to CFTP 
we refer to Casella, Lavine and Robert (2001). In real applications, the most 
challenging problem is that for many routine statistical problems, such as 
posterior sampling, the state space is both uncountable and unbounded. Al- 
though intense work has been done in this area, in terms of both general 
strategies and specific implementation [e.g., Kendall (1998), Murdoch and 
Green (1998), M0ller and Nicholls (1999), M0ller (1999), Murdoch (2000), 
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Hobert, Robert and Titterington (1999), Casella, Mengersen, Robert and 
Titterington (2002) and Philippe and Robert (2003)], the difficulties exhib- 
ited in Murdoch and Meng (2001) in the context of posterior sampling with 
a t likelihood and normal mixture priors indicate that much more research is 
needed before CFTP and its variants and extensions can become a method 
of choice in common statistical applications. 

2.2. Multiprocess antithetically coupled CFTP. Since CFTP delivers i.i.d. 
draws from the target density, when viewed as a "black box," it is no different 
from many classical Monte Carlo sampling methods, such as inverse CDF 
transform or rejection sampling. It is then natural to consider antithetic 
coupling for CFTP. However, unlike classical methods, the CFTP black box 
is a mapping from an infinite product space for {Ut, t < 0} to the state space 
S, which makes the underlying theory for guaranteeing NA (and EA) of the 
samples more complicated than those available in the literature. A theoreti- 
cal foundation is therefore needed, and this is established in Section 3. Here 
we focus on the description of the method itself. 

At the core of our method is the generation of k negatively associated 
{U^\ . . . , £A fe )}, each having the same marginal distribution as the U re- 
quired by U). There are many ways for doing so for a given distribution 
of U\ see Section 4. Given such a method, one can run k CFTP processes in 
parallel, the jth one using {U^ ,t < 0}, j = 1, . . . , k, where {U^,. . . , U^}, 
t < 0, are i.i.d. copies of {U<U,...,UW}, as sketched in Figure 1. Within 
the jth process of CFTP all paths are positively coupled because they use 
the same {U^\t < 0}. At each update, {u[ l \ . . . , f7 t } are NA, a property 
that clearly does not alter the validity of each individual CFTP process. 

To obtain n = km draws, we repeat the above procedure independently 
m times and collect {X- , 1 < i < m; 1 < j < k}, where i indexes the repli- 
cation, as our sample {X\, X2, ■ ■ ■ , X n }. Let a"j = Var 7r [/(A)] and p k = 
Corr 7r (/(Xj ), /(xj 2 ^)), which is intended to be negative. Then 



Consequently, the variance reduction factor (VRF) relative to the indepen- 
dent sampling with the same simulation size, is 



We emphasize here the dependence of Sj: on k and more importantly on 
/, and thus the actual gain in reduction can be of practical importance for 
some / but not for others. 



(2.3) 




(2.4) 



SW = l + (fc-l)pU 
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( f) 

The size- fixed VRF S y k JJ ignores the possible different computational re- 
quirements between generating k independent draws and k antithetically 
coupled draws. A completely fair comparison is typically impossible with 
any simulation study because the overall computational time is seldom 
exactly linear in size and it can depend critically on software, hardware, 
programming skill and the actual implementation. Nevertheless, a useful 
approximation can be derived by assuming linearity and by ignoring any 
"overhead." Specifically, let be the average CPU time needed to make 
a joint draw from a particular antithetic A;-process. Then, under our as- 
sumptions, for a given total CPU time T, the average number of dependent 
draws we can make is nd ep = Tk/rk, and the number of independent draws 
is n inc j C p = T/t\. Consequently, the time-fixed VRF is given by 



(2.5) 



T, 



if) 



CkS[ f) 



where Ck 



n 



Note Ck is free of /, but a large Ck may offset the gain in S k (i.e., by 
making T fc u; > 1), as seen in the next section. Also note that although we 
use the t\ notation for consistency, there can be a substantial difference be- 
tween setting k = 1 in a general fe-process subroutine and using a specifically 
designed subroutine for making independent draws; the latter was used in 
all our examples. 



2.3. An illustration with a mixture CFTP. Hobert, Robert and Titter- 
ington [HRT (1999)] considered the mixture model pfo(x) + (1 — p)fi(x) 
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Parallel antithetic backward CFTP processes. 
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where 0<p<l is the estimand. Given i.i.d. observations {x±, X2, ■ ■ ■ , x n } 
from this model, and a flat prior on p, the posterior for p is proportional to 
Yli=i[pfo(xi) + (1 — p) fi{xi)}. To deal with the continuous distribution of p, 
HRT use the natural discrete data augmentation that comes with the latent 
mixture indicator {z±,Z2, • • • , z n }, where Z{ = if X{ is from /o and Z{ = 1 
if Xi is from f\. Starting from T = 1, the following steps define the HRT 
algorithm: 

1. Start a "bottom chain" at p_ T = and a "top chain" at p^ T = 1. 

2. At each — T <t< —1 and for j = 1,2: generate n i.i.d. uniform r.v.'s 

{uti,u t 2, utn} = Ut, and then set z%} = if u ti < U) E trW^Kl < ^ 
(7) 

and z.V- = 1 otherwise. 
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Fig. 2. Normal mixture example. Size-fixed variance reduction factor (left) and 
time-fixed variance reduction factor (right) plotted against the number of parallel chains 
for different functions. Note the values of the time-fixed VRF for PD at k = 15 are too 
large for the plotting range. 
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Let m be the number of zjj 's equal to and let wt = {wt r ,r = 1, . . . , n + 
2}, where the u^ r 's are i.i.d. samples from an exponential distribution 

with mean 1. Compute p^i = 4>(pt ,«t, ™t) = 2~2™=i w tr/ E"=l w tr- 
3. If pg 1 ^ = = po, then po is our sample. If not, set T i<j = T and go to step 
2 with T = 2T Q id while keeping uu,wt r for all — T id < t < 0, 1 < i < n, 
1 < r < n + 2 unchanged. 

Although ip(p,u,w) of step 2 is not monotone (in the same direction) in all 
its arguments, the variance reductions obtained with antithetic variates are 
still significant in all the simulation examples we have looked at. Figure 2 
gives an illustration based on 0.33-AA(3.2, 3.2) + (1 - 0.33)^(1.4, 1.4). The 

first two columns in Figure 2 plot the size-fixed VRF S k against k. The 
plots on each row correspond to a different generating method: the permuted 
displacement (PD) method, Latin hypercube sampling (LHS) and iterative 
Latin hypercube sampling (ILHS); see Section 4. The simulation size here 

is 7500 for each k, and the Monte Carlo variance for estimating Su is 

— r (f) 
on the order of 10 . We see that Sj: decreases from 0.5-0.6 with k = 

2 to 0.2-0.3 when k > 6 (all numbers are from the ILHS method, which 

performs best). It appears that Sj: stabilizes after about k = 10. The second 
column shows the results for three nonmonotone estimand functions. The 

theoretical guarantee given in Section 3 does not apply to such functions, 

( f) ( f) 

but nevertheless all Si: < 1. Even for f(x) = sin(5x), Sj. is around 0.4, 

implying a 60% reduction in variance when k > 6. 

The last two columns in Figure 2 plot the time-fixed VRF, of (2.5). 

From the first row, it is seen that using k > 10 becomes too costly for PD, 

at least in our implementation. In contrast, the monotone patterns for the 

( f) 

second and third rows resemble that for Sj: , albeit the reduction is less 

because the Cj~ factor tends to be more than 1. Nevertheless, we still see 
( f ) 

that reduces from 0.8 with k = 2 to about 0.5 when k > 6, and the em- 
pirical finding that using k > 10 is not practically beneficial remains true. 
These conclusions are based on the best performing method, LHS (ILHS is 
obviously more costly because of the iteration). However, we emphasize that 
in our implementation we have made no attempt to optimize our code. It 
is thus important to separate the gain in statistical efficiency, as measured 
by Sj: , which does not depend on the particular implementation, from the 
possible offset, as measured by C^, which depends critically on the particu- 
lar implementation and thus could be further reduced with a more refined 
code / implementation. 

2.4. Forward antithetic coupling and slice sampling. The next applica- 
tion used for illustration is a forward slice sampler. A graphical scheme of the 
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antithetic principle applied to forward chains is shown in Figure 3. In this 
situation the parallel chains are started from different points in the sample 
space. After the so-called burn-in period, the realizations of each path are 
used, typically as positively correlated samples from the stationary distribu- 
tion of the chain. Like CFTP, we can update k chains using k antithetically 
coupled variates. However, unlike CFTP, the within-chain autocorrelation 
for the forward implementation makes the determination of the VRF a bit 
more complicated. Specifically, we need to generalize (2.4) to 



S, 
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(2.6) 
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where JtitrJ an< ^ Pt\trhh are res P ec tivery within-chain and between- chain 
autocovariance, that is, 



(2.7) 
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cov(/(x t ( f } ; 



In many applications expression (2.6) can be greatly simplified because of 
the ergodicity of the antithetically coupled joint chain (which is not an auto- 
matic consequence of the ergodicity of the marginal chains; see Section 3.2). 
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Fig. 3. Parallel antithetic forward MCMC chains. 
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Here we retain the full generality of (2.6) to show that as long as all the 
between-chain covariances are nonpositive, the use of antithetic coupling will 
always guarantee SI < 1 regardless of the mixing properties of the individ- 
ual chains (e.g., still in burning periods) or its within-covariance structure — 

( f) 

that is, using antithetic coupling cannot hurt as long as (3l lt2 .j 1 j 2 < for all 
(*ij t2, ji, J2); see Section 3 for conditions to guarantee this. The definition 
of time-fixed VRF remains the same as in (2.5) because Ck is unaffected by 
the autocorrelation. 

To illustrate possible gains, we adopt an example of Damien, Wakefield 
and Walker (1999), who use slice sampling [Neal (2003)] to draw from tt(x) oc 
x 2 e~ e I{ x >o}- The resulting Gibbs sampler has the updating function, 

(2.8) X t+l = 1>(X U 6, 6) = £i /3 log(e Xi - log(l - 6)), 

where £1 and £2 are i.i.d. Uniform(0, 1). Since ^(^j £i> £2) is nondecreasing, 

( f) 

Theorem 1 of Section 3 guarantees that S k < 1 for any monotone function 
/. Note (2.8) is an example where ip is made to be nondecreasing by using 
1 — £2 instead of £2 inside the second logarithm. 

Figure 4 is the counterpart of Figure 2 for the forward case, with sim- 
ilar general patterns (the simulation size here is 5000). In particular, for 
( f) 

monotone functions Sr decreases from between 0.35-0.45 with k = 2 to 

0.1-0.15 when k > 6. For highly nonmonotone f(x) = sin(5x), in contrast 
( f) 

to k = 2, Su < 1 when k > 3. For less variable nonmonotone functions 

f(x) = 2x(l + x 2 )" 1 and f(x) = x(l — 5x), Sff goes below 0.5. Partial the- 
oretical support for this phenomenon can be found in Owen (1997), whose 

( f) 

result implies that, under LHS, S k <k/(k — 1) for any square-integrable /. 

(f) (f) 
This suggests that there is more room for Sr, to exceed 1 than for Sr with 

k > 2. However, in terms of the time-fixed VRF, the clear winner is the LHS 
method with monotone functions. For highly nonmonotone functions, our 
implementation becomes too costly, suggesting that if such estimand func- 
tions are of main interest, then it would be generally safer to just use the 
independent implementation of multiple chains, as recommended in Gelman 
and Rubin (1992). 

2.5. A real-data application with Bayesian probit regression. We con- 
clude our empirical investigation by presenting a multidimensional real-data 
application. The data are taken from van Dyk and Meng (2001) and consist 
of measurements on 55 patients, of whom 18 have been diagnosed with la- 
tent membranous lupus. Table 1 shows the data with two clinical covariates, 
IgA and IgG, that measure the levels of immunoglobulin of type A and of 
type G, respectively. Of interest is the prediction of disease occurrence using 
the two covariates IgG3 — IgG4 and IgA. As in van Dyk and Meng (2001), 
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Fig. 4. Slice sampling example. Size-fixed variance reduction factor (left) and time-fixed 
variance reduction factor (right) plotted against the number of parallel chains for differ- 
ent functions. Note the change of scale in the vertical axis from the sized-fixed VRF to 
time-fixed VRF. 



we consider a probit regression model. For each patient i, we model the dis- 
ease indicator variables as independent Yi ~ Bernoulli^ (x^/?)), where $(•) 
is the CDF of N(0, 1), x% is the vector of the covariates and j3 is a 3 x 1 
vector of parameters. We want to sample from the posterior distribution 
corresponding to the flat prior for (3. 

To illustrate the impact of antithetic coupling on mixing, we adopt the 
standard Gibbs sampler with the latent variables ifti ~ N(x[f3, 1), of which 
only the sign Yi is observed, as the augmented data [e.g., Albert and Chib 
(1993)]. Let X be the nxp matrix whose ith row is X\ and tp = (ipi, ■ ■ ■ , Y>n)i 
for our example n = 55 and p = 3. The resulting Gibbs sampler alternates 
between sampling from (i) j3\ip ~ N(J3, {X T X)~ l ) with (3 = (X T X)- 1 X T ip 
and from (ii) fclp, Y { ~ TN(xf[3, 1, Y-), where TN(fi, a 2 ,Y) is JV(jt, a 2 ) trun- 
cated to be positive if Y > and negative if Y < 0. Our parallel chains are 
then coupled antithetically at each of these two updating steps. 
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Table 1 

The number of latent membranous lupus nephritis cases, the 
numerator, and the total number of cases, the denominator, for 
each combination of the values of the two covariates 
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In the first step we want to generate exchangeable ■ . ■ ,(3^} with 

E W for all l<j<k, such that marginally /#) ~ N((3, S = {X T X)- 1 ) 

and Corr(/3- ) is minimum possible, for all 1 < i < p and all 1 < 

j 7^ j' < k. This can be realized by generating p i.i.d. multivariate normal 
(Zf \ . . . , zf } ) T such that Z\ j) ~ JV(0, 1) and Co^(z\ j \z\ j ' ] ) = -l/(k- 1). 

We then let = X 1 / 2 ^') + /3, where = (z[ j) , . . . , Z ( p j) ) T and S 1 / 2 is 
the Choleski decomposition of E. 

In the second step, we use the inverse CDF method suggested by Gelfand, 
Smith and Lee (1992) to sample from TN(fi,a 2 ,1). Namely, we simulate 
U ~ Uniform(0, 1) and then take 

(2.9) Z = ii + a^m-fi/a) + U(l - $(-/V<r))]. 

The methods of Lew (1981) and Bailey (1981) were used to approximate 
$(x) and $ _1 (x) needed for (2.9). The antithetic coupling is then realized via 

(1) (k) 

n independent vectors of NA uniform random variables {U- , . . . , U- ; 1 < 
i < n} so that for the jth chain ip^ is generated by using U = u\ j) in (2.9). 
We used only two iterations of ILHS, to increase the speed of the antithetic 
sampler. The computational overhead is small even in the first step as all 
the matrix inversions required there were performed once outside the inner 
loop of the stochastic algorithm. 

The improvement brought about by the antithetic coupling is twofold 
because it can reduce both the variance and the bias of the original sampler. 
Van Dyk and Meng (2001) demonstrate the slow mixing of the standard 
algorithm and propose a much more reliable marginal data augmentation 
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algorithm. Our antithetic implementation does not, and cannot, remedy 
all the problems with the original sampler, but it can provide noticeable 
improvement without requiring a new algorithm or a substantial increase in 
the computational load. This is seen in Figure 5, which contains scatter plots 
of (/3o,/3i) using draws from k independent chains of the original sampler 
(left column) and from k antithetic chains (right column). The true contour 
plots, obtained via numerical methods, are from van Dyk and Meng (2001). 
Each scatter plot is based on 9000 sample points, divided equally among 
the k chains used. All the starting points were set to the same MLE; we did 
this on purpose to see to what extent the antithetic coupling can "spread 
out" despite the fact that all chains are started at the same point. Although 
the antithetic chains still have the serious problem of missing a good part of 
the "right tail," a problem that was avoided by van Dyk and Meng's (2001) 
algorithm, in all rows the scatter plot in the right column extends to the 
"right tail" no less, and sometime substantially more, than the scatter plot 
in the left column. 

To represent quantitatively the benefit of antithetic coupling, Figure 6 
plots, against k, the relative bias, relative standard error (SD) and relative 
root mean square error (RMSE) from antithetic chains, in estimating six 
posterior quantities, with the results from independent chains as the baseline 
(e.g., a relative bias 0.4 means the antithetic implementation reduces the 
bias, in magnitude, by 1 — 0.4 = 60%). The six quantities we choose are, in 
the order of the rows in Figure 6, E[(3 \V], E[(3i\V], Var[/3 |£>], Var[/3i|P], 
E[-p /Pim and E[Q\V], where Q = - 0.5(3 2 + 1.5/3 2 )]/[l - $(/3 - 

0.5/?2 + 1.5/32)] an d T> denotes the observed data (X, Y). Whereas the first 
four are usually required in a Bayesian analysis, the last two are specific 
to the example at hand: — /3o//?i is the so-called LD50 level (i.e., with 50% 
chance the corresponding dosage becomes lethal) for covariate x\ when x 2 = 
0, and Q represents the odds of having the disease when x\ = —0.5 and X2 = 
1.5. In particular, four out of these six estimand functions are nonmonotone. 

The simulation sizes in Figure 6 are the same as in Figure 5, and the 
reduction factors are then computed using 1000 replicates of the simulation 
process. No burn-in period was discarded because we are interested in com- 
paring the mixing properties of the two implementations. Because the total 
sample size is fixed at 9000 for each plot, using a larger k means a smaller 
within-chain sample size m, and hence possibly more strong influence of the 
starting point. To investigate the effect of the starting point, we use three 
starting strategies: 1. MLE (columns 1-3 in Figure 6). 2. At two standard 
deviations (SD) from the MLEs, in this case, (3q + 2SD and 0i — 2SD, i = 1,2 
(columns 4-6). 3. From extreme points (more than three SD away from the 
MLE) in the support of the distribution (columns 7-9). In general it is seen 
that the RMSE is reduced due to a decrease in variance, bias or both. Best 







Fig. 5. Bayesian probit regression example. Scatter plots of (/3o,/3i) generated by using 
k independent chains (left) and k antithetic chains (right) for different values of k; the 
contour lines are from the targeted bivariate posterior distribution. Each plot contains 9000 
draws. 



results correspond to 3 < k < 6, reflecting the aforementioned trade-off be- 
tween k and m. As expected, the reduction factors vary with the starting 
points, as well as the estimand of interest. Nevertheless, with k > 3 the an- 
tithetic coupling in general does not innate the RMSE and may result in 
reduction factors as low as 0.1, that is, up to 90% of saving. 

3. A theoretical foundation. 

3.1. Negative association and dependence. In general, a qualitative mea- 
sure of negative dependence should adequately reflect the following intu- 
itive behavior among a set of variables: if one subset of the variables is 
"high," then a disjoint subset of the variables is "low." Different ways to 
define such negative dependence have received a great deal of attention 
in the last twenty years or so. Due to the success of the positive associ- 
ation concept of Esary, Proschan and Walkup (1967), the main challenge 
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Fig. 6. Relative Monte Carlo bias, standard error (SD) and root mean squared error 
(RMSE), from the antithetic chains, relative to that of independent chains, as functions 
of k, the number of parallel chains. Same simulation configurations as in Figure 5, and 
with 1000 replication for each k. Each row represents a different estimand function; from 
top to bottom: E[f3 \V\; E[/3i|D]; Var(/3b|2>); Var(/3i|X>); E[-/3 //3i \V]; E[Q|£>], Columns 
1-3 use MLE as the starting points, 4-6 use MLE ±2 SD, and 7-9 use some extreme 
points situated more than three SD away from the MLE. 



has been to build the negative association concepts as "duals" of the pos- 
itive ones, but so far there has not been a universally acceptable con- 
struction [e.g., Pemantle (2000)]. Specifically, the set of random variables 
{Xi, . . . ,X n } is said to be positively associated (PA) if for any nondecreas- 
ing (or nonincreasing — we will not state both hereafter) functions /i,/2, 
Cov(/i (X 1 , X 2 , . . . , X n ), f 2 {X 1 , X 2 , . . . , X n )) > 0. Here a function / : R" R 
is called nondecreasing if it is nondecreasing in all of its arguments. 

The closest negative counterpart of positive association we can find is the 
negative association concept introduced by Joag-Dev and Proschan (1983). 



Definition 1. The random variables {Aj}i<j< n , where each Xi can be 
of arbitrary dimension, are said to be negatively associated (NA) if for every 
pair of disjoint finite subsets A\,A 2 of {1, 2, ... ,n} and for any nondecreasing 
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functions fa, fa 

(3.1) Cov(/i(X i5 ie AxlMXjJ g A 2 )) < 

whenever the above covariance function is well defined. 

The following equivalence result is important for some of our subsequent 
theoretical investigation. 

Proposition 1. The random variables {Xj}i<j< n are NA if and only 
if (3.1) holds for every pair of nonnegative, bounded and nondecreasing func- 
tions fi and fa. 

Proof. The "only if" part is obvious. To prove the "if" part, let f m (Xi,i G 
A m ), m = 1,2, be the two functions in (3.1). For any positive integer I, 

let fm be the truncation of f m to [—/,/], that is, fm = f m when |/ m | < 
/, and = db/ depending on f m > I or f m < -I. Clearly, \f$\ < \ f m \, 
and thus by the dominated convergence theorem, lim^oo Cov(/}^(Xj, i G 
AJJ^iXjJ G A 2 )) = Cov(/i(Xi,z G A 1 ),/ 2 (X i ,j G A 2 )). This allows us 
to conclude (3.1) because 

CavifP&uieAJjPiXjJ G A 2 )) 

= Cov(fi\Xi,i G Ai) + 1, /f (Xj, j G A 2 ) + 1) < 0, 

where the last inequality holds because fm + I is a nonnegative, bounded 
and nondecreasing function for m = 1, 2. □ 

The notion of NA is most useful for our purposes primarily because, 
like PA, it is closed under the independent union operation as well as mono- 
tone transformations, as proved in Joag-Dev and Proschan (1983). Specifi- 
cally, we have: 

Proposition 2. If{X\, . . . ,X ni } and {Yi, . . . , Y n2 } are two independent 
sets of NA (PA ) random variables, then their union, {X±, . . . , X ni , Y\, . . . , Y n2 }, 
is a set of NA (PA) random variables. 

Proposition 3. If {Aj}i<j<„ is a sequence of NA (PA) random vari- 
ables and (ipi)i<i< n ar e 0,11 nondecreasing functions, then (V'i(^j))i<i<n is 
a sequence of NA (PA) random variables. 



These two results enable us to prove the following fundamental result on 
antithetic coupling of homogeneous and nonhomogeneous Markov chains. 
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Theorem 1. Suppose we run a k-process antithetically coupled chain, 
X t k = {X^\. . .,x[ k) ], forward f or T iterations, where x[ j) = ^(X^U^), 

j = 1, . . . , k, and U k = {Z7 t , . . . , U^}, t = 1, . . . , T, are T independent sets 
of NA variables. Assuming t^t(X, U) is nondecreasing for all t <T, we have 
the following results. 

(i) The k-tuple {X^ , xj®} is a collection of k NA variables for any 
{ii, . . . , ifc} £ {0, 1, . . . , T} k if and only if it is so for t\ = ■ ■ ■ = = 0. 

(ii) Assuming Xq is a set of NA variables, then the variance reduction 

( f) 

factor, Su , defined by (2.6), is at most 1 for any monotone function f, 

and it is strictly less than 1 if and only if at least one of the between-chain 
( f) 

covariances, Pt 1 t 2 -j 1 j 2 °f * s strictly negative. 

Proof. For (i) the necessity holds by definition. For the sufficiency, be- 
cause {X^ , . . . , X^} and {U^ , . . . , U^ k) }, t = 1, . . . , T, are T + 1 sets of 
independent NA variables, by Proposition 2 their union is also NA. Conse- 
quently, because xj? is a nondecreasing function of = {Aq , ?7 t , i = 
1, . . . , T} only, and Z^ 1 and Z^ do not share any common variable for i^j, 

by Proposition 3 {X^ xj®} are NA for any tj < T, j = 1, . . . , k. 

( f) 

For (ii), by (i) all the between-chain autocovariances Pi\_ trh h — ^' wmcn 

( f) 

implies Si < 1 because the denominator in (2.6), being a variance, is always 
positive. (In fact, by using the PA part of Propositions 2 and 3, we can 

conclude all Jti t 2 -j — ^0 Furthermore, because all Pi x trhh — ^' ^ 

(f) (f) 
S k = 1 if and only if all Pi x trh are zero - ^ 

Since for a Gibbs sampler with attractive stationary density the updating 
function can be expressed as a monotone function [e.g., Propp and Wilson 
(1996) and Haggstrom and Nelander (1998)], Theorem 1 covers Frigessi, 
Gasemyr and Rue's (2000) Theorem 1 with k = 2. For practical purposes, the 

requirement that Xq = {X^\ . . . ,Xg } are NA is immaterial, because being 
fixed (even with the choice that Aq 1 ^ = • • • = Xq ) or more generally being 
independent is a trivial case of being NA. It is also evident that Theorem 1 
does not require Aq to be from the stationary distribution. 

Theorem 1, however, does require that the random variables in the fc-tuple 
Lit = {U^\ . . . , U^} are NA. Typically, for a particular distribution there 
are many ways of achieving this, and some general recipes are described in 
Section 4. A useful construction in practice is to use the fact that mono- 
tone functions of distinctive subsets of NA variables are NA, a fact that 
allows us to build upon known NA variables, such as permutation distribu- 
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tions, multinomial, multivariate hypergeometric, Dirichlet and multivariate 
normals with nonpositive correlations [see Joag-Dev and Proschan (1983)]. 

We remark that part (i) of Theorem 1 is actually stronger than we need, 

( f) 

because from (ii), in order to assure Sf. < 1, we only need to ensure pair- 
wise negative covariance: Cow{f{X { t { l) ),f(x[i 2) )) < 0. This weaker version 
is sometimes easier to establish, and thus we define the following notion. 

Definition 2. The random variables X\,X2, ■ ■ ■ ,X n are said to be 
pairwise negatively associated (PNA) if {Xi,Xj} are NA for any i ^ j £ 

Clearly, Propositions 2 and 3 still hold for PNA. In the next section, we 
will see that only requiring our outcome to be PNA can avoid technical 
complications that are not critical to MCMC applications in practice. In 
particular, the PNA property is easier to verify because it is equivalent 
to the negative quadrant dependency (NQD) notion for a pair of random 
variables, as defined by Lehmann (1966). For more than two variables, NQD 
is a consequence of NA, but not vice versa, as formalized by the following 
result, also due to Joag-Dev and Proschan (1983). 

Proposition 4. Let Xi,X 2 , . . . ,X n be a set of NA random variables. 
Then they are also negatively lower orthant dependent (NLOD), that is, 

n 

(3.2) P(Xi<xi,...,X n <x n )<Y[P(Xi<Xi) for all xi,...,x n , 

i=l 

as well as negatively upper orthant dependent (NUOD), namely, 

n 

(3.3) P(Xi>x 1 ,...,X n >x n )<Y[P(X i >x i ) for all x 1: ..., x n . 

i=l 

3.2. Negative association in limit and joint convergence. Although The- 
orem 1 provides a theoretical foundation for antithetic coupling with forward 
MCMC, it does not cover backward MCMC such as CFTP. This is because 
the T in Theorem 1 is a nonrandom constant, whereas for CFTP it is a 
random variable and, more importantly, it is not independent of the cor- 
responding draw from the CFTP. We thus need to extend Theorem 1 to 
the case with T = +oo, namely, we need to prove that the limiting /c-tuple 
{xQ , . . . , X$ } is still NA, or at least PNA. This would entail that the joint 
draw from a CFTP, {X^ X^ k) }, is NA/PNA because we can identify its 
probability structure with that of {x£) , . . . ,X$} from a forward process. 
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The extension to T = oo is not straightforward because the ergodicity of 
the marginal chain does not guarantee that of the joint chain, so that a limit- 
ing argument such as Cov(/(A^), f{X&)) = lim^ Cov(/(X t W ), f{x\ j) )) < 
needs qualification. The problem has been discussed also by Arjas and Gas- 
barra (1996) and Frigessi, Gasemyr and Rue (2000) who have shown that, if 
the joint chain is 0-irreducible and the closure of the support of <f> contains 
an open set, then the joint chain is positive recurrent on the closure of the 
support of (ft. This can be used to prove that the above convergence of covari- 
ances holds. However, the assumption that the support of the irreducibility 
measure (ft has nonempty interior is violated even by simple examples such 
as the following Markov chain: take Xq G Si, the unit circle; for t > 1, draw 
0± uniform on [0, ir) and construct the line that goes through Xt-i and has 
slope tan(^). The intersection of this line with the unit circle is Xt. The left 
panel of Figure 7 illustrates this construction with t = 1. The right panel 
illustrates a pair of antithetically coupled chains (Xt,Yt) (for t < 2) where 
the Y t is constructed the same way as X t except each time X t is updated 
using 9t, Yt is updated using ir — Of. Algebraically, we have 

Of = (-Ci + + 20 t ) mod(2vr), 

(3.4) 

6>f = + vr - 2Q t ) mod(2vr), 

where 6f and 6j are the polar-coordinate representations of X t and Y t , 
respectively. Marginally, Of (and hence Oj) are i.i.d. uniform variables on 
[0, 27r), and hence {Xt}t is trivially uniformly ergodic. However, the joint 
state space of (0f,0Y), S2 = [0,27r) x [0, 2ir), consists of uncountably many 
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absorbing subclasses, denned by 

(3.5) S(t) = {(9 x ,9 y )£S 2 :6 x + 9 y = (2k + 1)tt ±2t, fork = 0,1}, 

where r G [0,tt/2] is the acute angle between the line XqYq an d the horizontal 
diameter (see Figure 7). Here S(t) is an absorbing set because the joint 
updating rule leaves the acute angle between the line XtYt and the horizontal 
diameter, denoted by r t , unchanged. Let pt(r) = Corr(X t , Y^), where Z^> 
denotes the X-axis coordinate of Z [i.e., = cos(0^)]. Then, as proved 
in Appendix A.l, pt{r) = — cos(2r), increasing from —1 to 1 as r increases 
from to 7r/2. In addition, for each of the ergodic classes, the joint chain 
has an irreducibility measure whose support has an empty interior. 

This example shows that the uniform ergodicity of the marginal chain is 
not sufficient to guarantee the ergodicity of the antithetically coupled chain 
or to directly justify the extension of the NA properties from { x\ ^ , . . . , x[ k ^ } 

to its "limit" {x£, , . . . ,xj£ }■ Here the "limit" is in quotes because while 

(i) 

the marginal distribution of X^ does not depend on the starting point, the 

(1) (k) 

joint distribution of {Aoo , . . . , Xso } may depend on it, as is the case in the 
previous example. Fortunately, the time-backward dual sequence approach 
discussed in Section 2.1 provides a theoretical tool to bypass the issue of the 
joint chain's ergodicity, and thereby to establish the following counterpart 
of Theorem 1 for justifying backward antithetic coupling for all monotone 
CFTP algorithms. 

Theorem 2. Let Xt = ij)(Xt—i, Ut) be a uniformly ergodic Markov chain 
on state space T with ir being its invariant distribution and with ip nonde- 

creasing. Let X* = {x[ l \ . . . , X^} be the corresponding antithetically cou- 
pled joint chain as in Theorem 1. Then: 

(i) For any given starting point Xq , X^ = {x[ l \ . . . , X^} converges in 

distribution to some X^ = {Xst , . . . ,xlx }> whose joint distribution may 
depend on Xq . 

(ii) For any nondecreasing real functions f m , m = 1,2, on T such that 
Var 7r (/ m (X)) < 00, m = 1,2, and such that their discontinuity points are 
contained in a ir-null set, we have 

(3.6) Coy(f 1 (X^),f 2 (X^)\X^) < for any j, jkj 2 . 

(i) ~ (i) 

Proof. For each marginal chain X t , let X\ be its dual sequence as 
in (2.2). Then, by the equivalence of the implement ability of CFTP and uni- 
form ergodicity [Foss and Tweedie (1998)], we know that there exists an al- 
most sure finite stopping time S on the infinite-product space = Y\ t >i^t 

such that X^ ~ vr and A t (j) = X^ for all t > S and all j. Therefore, by 
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re-expressing all relevant variables on their common sampling space Ty , 
we can define X^uj) = {X^/^Jlj), . . . , X^)ju;)}, which is a well-defined 
/c-component joint random variable on T^j because P(S(uj) < oo) = 1. Pur- 

~ h ~ (1) ~ (k) h 

thermore, because X* = {X\ , . . . , X\ } converges to X^ with probability 1 
by the construction, and because Xj* and Xj* have the same distribution for 
any t, Xf must converge in distribution to X^, and hence (i). 

For (ii), we only need to prove (3.6) when both /'s are bounded, following 
the same argument as in the proof of Proposition 1. Since f m ,m = 1,2, are 
almost surely continuous with respect to the distribution of , X$} be- 
cause its margins are ir, we can conclude by part (i) that {fi(X^), f2(xjr*^)} 
converges m distribution to {fi(X$)J 2 (X&)}. This implies 

(3.7) Cov(/ 1 (xW),/ 2 (JTCj))|A*)= hm Cov(/i(X t (i) ), h{X?)\X*) < 0, 

t— >oo 

where the limiting argument holds because both /'s are bounded. □ 

Part (ii) of Theorem 2 shows that {xj^} , . . . ,X^} are PNA when the 
monotone functions are almost surely continuous with respect to the under- 
lying dominating measure, which is the case if the latter is the Lebesgue 
measure, as is common in Bayesian computation. We emphasize that the 
condition that ip is monotone is not needed for part (i), but it is important 
for part (ii). This is seen in the unit circle example, where (i) holds with 
uncountably many limiting distributions for the joint chain, depending on 
the initial r. However, (ii) does not hold because the ip function there is 
obviously not monotone. The fact that p(r) < when t < tt/4 in that ex- 
ample also illustrates that the monotonicity is a sufficient but not necessary 
condition for preserving negative correlation. 

3.3. Extreme antithesis. The concept of NA provides a qualitative de- 
scription of negative dependence. Quantitatively, it is desirable to generate 
{X u ...,X k } (corresponding to {X^, . . . ,X^} in previous sections) such 
that Coii(f(Xi),f(Xj)) is as negative as possible. Formally, we define the 
following notion. 

Definition 3. A set of variables {Xi,X2, . . . ,X k } is said to achieve 
extreme antithesis (EA) with respect to a (marginal) distribution F if they 
are exchangeable and 

Coxx(Xi,Xj) = min{Corr(yj, Yj) : Yi, . . . , Yfc exchangeable, Yj ~ F Vi}. 

For k = 2, the single strategy of using quantile coupling via X\ = F~ 1 (U) 
and X2 = i ? ~ 1 (l — U) achieves EA for any F, as discussed in Section 1. For 
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k > 2, the matter is much more complicated, even just for F := Uniform(0, 1) 
and F := N(0, 1). Indeed, it is not hard to establish the following negative 
result [see Craiu and Meng (2002) for a proof], where $ is the CDF of 
N(0,1). 

Proposition 5. It is impossible to find a joint distribution F^ 3 '(Ux, U?, Us) 
on (0, l) 3 such that all its univariate margins are Uniform(0, 1) and the fol- 
lowing holds almost surely (with respect to joint Lebesgue measure): 

(3.8) Ut + U 2 + U 3 = 3/2 and ^(Ui) + ^~\U 2 ) + ^\U 3 ) = 0. 

For k even, say, k = 4, one may be tempted to use two pairs of quantile 

coupled variates, namely, U 2 = 1 — U%, U4 = 1 — U 3 , where U\ and Us are i.i.d., 

( f) ( f) 

to make the k = 4 version of (3.8) hold. Consequently, for any / p\ — p\ , /3, 

( f) ( f) 

where p 2 = Corr(/(£7i), f(l — U±)) and p 4 is the correlation between any 

pair of {f(Ui),i = 1, ... ,4}. This implies = 1 + 3p { / ] = 1 + p [ 2 f) = 
and thus it is just a disguised version of quantile coupling with k = 2, not a 
real generalization to k = 4. 

For lack of a universal strategy, we seek methods that are effective in 
common practice. In the exchangeable setting, the quest for EA is the same 
as that for minimizing the variance of the mean (and sum) = (X\ + 
• • • + Xk)/k because [see (2.3)] 

Ysx k (X^>) 



(3.9) p k = CoTT{Xi,X- 



k-1 



Var(Xi 



1 



1 

> . 

- k-1 



where the subscript k in both p k and Yax^X^) emphasizes the depen- 
dence on the joint distribution F^(x\, . . . ,£&). In particular, if an is 
constructed such that X^' = constant (almost surely), then EA is achieved 
[hence (3.8)]. However, although this approach works for some common dis- 
tributions such as uniform and normal, it is not always possible because the 
minimal value of Var^A"^)) may not achieve zero even when k = 2. For 
example, the minimal p2 is 1 — 7r 2 /6 = —0.645 when both Ai and A2 are 
exponentially distributed with mean 1, as reported in Moran (1967). 

For specific families of distributions, several methods may be available 
to achieve EA. Snijders (1984) explores various approaches for binary ran- 
dom variables. For a unimodal symmetric and differentiable density p on 
Ft, Riischendorf and Uckelmann (2000) propose the following. Suppose the 
center of symmetry is zero and that pq(x) = —xp'(x) is also a Lebesgue 
density on R; let Q ~ pq. Then A = QU ~ p for any U ~ Uniform(— 1, 1) 
that is independent of Q. Consequently, for any set of {U±, . . . , U k }, in- 
dependent of Q, such that U ~ Uniform(— 1, 1) and J2i=iU = 0, {Ai = 
QUi, . . . , Afe = QUk\ achieves EA with respect to p because Ya=x Xi = 0. In 



MULTIPROCESS ANTITHETIC COUPLING FOR MCMC 



23 



fact, Corr(Xj, Xj) = Corr(c^, Uj) = —{k — for any i ^ j. This construc- 
tion, however, does not guarantee NA in general. As an alternative, we can 
draw i.i.d. {Qi, ■ ■ ■ , Qk} from pq, and then use Xi = QiUi, Vi £ {1, . . . , k}. 
This will guarantee the NA property, but it sacrifices the EA property, be- 
cause now we have Corr(Aj, Xj) = — + CV 2 (Q)]~ 1 , where CV(Q) is 
the coefficient of variation of Q. Nevertheless, when CV(Q) is small, the 
loss of EA may be negligible for practical purposes. In general, if one has 
to make a choice between NA and EA, we recommend choosing NA, for it 
is preserved by all monotone transformations. With NA in place, the vari- 
ance reduction factor is guaranteed to be at most 1 when the monotonicity 
assumption holds. 

4. Generating antithetic uniform variates. Since generating uniform vari- 
ates is, explicitly or implicitly, the most basic component of almost any sim- 
ulation method, in this section we compare several methods for generating 
fc-antithetically coupled uniform variates. For each method, we investigate 
whether it leads to NA and/or EA variables, and propose remedies whenever 
possible if it does not. We emphasize that although there are many different 
ways of achieving NA and/or EA [e.g., Bondesson (1983) and Gerow and 
Holbrook (1996)], no method dominates when k > 3, as demonstrated by 
Proposition 5. Any of these methods can be more suitable than others in a 
particular application. Nevertheless, the three methods described below are 
more or less representative of what has been proposed in the literature. 

4.1. The permuted displacement method. This is a modified version of 
the one documented in Arvidsen and Johnsson (1982), which first generates 
an n ~ Uniform(0, 1), and then constructs 

(4.1)ri = {2^ 2 n + |}, i = 2,...,fc-l, and r k = 1 - {2 fc - 2 n}, 

where {x} is the fractional part of x. We find that the binary representation 
of this method makes it a bit easier to show that Y^t=i r i = k/2. Specifically, 
let n = (a±, aii ■ ■ ■ ,a m , . . . ) denote the (nonterminating) dyadic expansion 
of n, where ai = or 1, that is, r\ = J2iLi Then 

r 2 = (1 - ai,a 2 ,a 3 , . . . ,a m , . . .), 

r 3 = (1 - a 2 ,a 3 ,.. . ,a m +i, ■■■), 

r k-l = (1 — a-k-2,(lk-l, ■ ■ ■ ,«m+fc-3> ■■■)> 

r k = (1 - Gjfc-l, 1 — Ofe, . . . , 1 — a m+ k-2, ■■■)■ 

Therefore, the method creates negative correlation by displacing the binary 
digits of r\ . That all r i 's are uniform is a direct consequence of a well-known 
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result of Borel (1924), as discussed in detail by Billingsley (1986). The dyadic 
expansion representation shows clearly that the r's are not exchangeable. A 
consequence of that is that they are not NA when k > 3. To prove this, 
we only need to show that when k > 3, {n,^} are not NQD, and thus 
by Proposition 4 they cannot be NA. To see this, we note that for k > 3, 
r 2 — r i + 0.5 when t\ < 0.5 and r 2 = T\ — 0.5 when t\ > 0.5. Consequently, 
for < s < 0.5 and s + 0.5 < t < 1, we have 

P(ri <t,r 2 <s) = P(0.5 < n < t, n - 0.5 < s) = mm{t, s + 0.5} - 0.5 = s, 

which is larger than ts = P{r\ < t)P(i"2 < s), and therefore (3.2) is violated. 

However, it is easy to fix the nonexchangeability by using the simple 
permutation method. That is, let Sk be the set of all possible permutations 
of k objects. Pick a random a £ Sk and define U{ = r a u\. Clearly J2i^i = 
J2i r i = k/2 and thus Corr([/j, Uj) = — (k — 1) _1 for any i ^ j. Furthermore, 
it can be shown that: 

Theorem 3. For k = 3, {Ui,U2,Us} constructed by the permuted dis- 
placement method are PNA. 

The proof is given in Craiu and Meng (2002) and is omitted both because 
of space limitation and because the approach used was rather brute force. 
Indeed, we are unable to prove or disprove Theorem 3 for k > 4. Nevertheless, 
the result indicates that the exchangeability can play an important role in 
achieving NA/PNA. An astute reader might wonder how permuting indexes 
can be helpful since in MCMC our estimates typically are sample averages, 
which are invariant to the independent permutations of the sample indexes. 
However, one must keep in mind that in our use of the antithetic variates, the 
LPs are not used just once in the end, but throughout the whole generated 
sequence and the final estimates are not invariant to the permutations of all 
the indexes that occur along the sequence. This is perhaps easiest to see for 
CFTP, as each draw can depend, in principle, on an arbitrarily large number 
of independent copies of the /c-tuple of LPs. 

4.2. Multidimensional normal method. A common way to manipulate 
correlations, especially in the engineering literature, is through the multivari- 
ate normal distribution. For our purposes, we can first generate (Zi, . . . , Zfc_i) T ~ 
N(0, S) where Yly = —(k — 1) _1 if i ^ j and Sjj = 1, and then let Z^ = 
— {Z\ + Z2 + • • • + Zk—i). Finally, if we are interested in uniform deviates, 
we can use £/» = for all i € {1, ... , k}, where $ is the CDF of N(0, 1). 

The random variables {U\, . . . , U^} are NA following Joag-Dev and Proschan 
(1983), who proved that multivariate normals with nonpositive pairwise cor- 
relations are NA. The result then follows immediately from Proposition 3 
because Ui is a monotone function of Z^. 
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This method, however, does not achieve EA. Although Corr(Zj, Zj) = 
— (k — 1) , the nonlinear transformation Ui = <&(Zi) causes an increase in 
correlation. Specifically, for any {Z±,Z2} bivariate normal with correlation 
p, the corresponding correlation between &(Z\) and $(^2) is given by 



where <f> is the standard normal density. Consequently, Corr(£/i, U2) is larger 
than the minimum possible value — (k — 1) , though the loss of EA may 
not be important for many practical applications in view of (4.2). Equation 
(4.2) is a quantitative version of a result of Lancaster (1957), who proved 
that for any functions g± and <?2> I Corr(gi(Zi), 52(^2))! < \p\ as long as 
the left-hand side is finite, and the equality holds if and only if gi{z) = z, 
i = l,2 (almost surely). Lancaster's result also implies Proposition 5 if we 
require {Z±, Z2, Z3} to be jointly normal. Nevertheless, Proposition 5 is a 
stronger result, as it shows that it is impossible to find any such trivariate 
distribution, not just trivariate normal, to simultaneously preserve EA as in 



An issue with real impact in computation is the requirement for a highly 
reliable and efficient subroutine to evaluate the function $. Otherwise, the 
use of a not-highly accurate approximation becomes problematic in large 
replications with arbitrarily many arguments because once in a while \Z\ 
can be too large for <&(Z) to be evaluated appropriately. For that reason, 
we did not use this method to simulate uniform deviates in the simulated 
examples in Section 2. However, we used it to generate antithetic truncated 
normal deviates in the probit example presented there. 

The normal method is also of interest because it is one that many will 
likely attempt, especially for generating antithetic normal deviates in high- 
dimensional settings. An example where such implementation results in ac- 
celeration of a classical MCMC method is provided by Craiu (2004) in the 
context of Multiple- Try Metropolis with antithetic proposals. 

4.3. Iterative Latin hypercube sampling. The method described in Sec- 
tion 4.1 achieves EA but whether it achieves NA is an open question, and 
the method given in Section 4.2 achieves NA but not EA (when used for 
generating uniform variates). To achieve both, we propose to use iterative 
Latin hypercube sampling (ILHS), which is an iterative version of the Latin 
hypercube sampling (LHS), a well-known scheme in quasi Monte Carlo; 
see McKay, Beckman and Conover (1979), Stein (1987), Owen (1992), Loh 
(1996), Iman (1999) and Helton and Davis (2003), among others. 

For any given k > 2, our iterative procedure can be described by the 
following steps: 



(4.2) 




= 0.955p + 0.01p 3 + 0(p 5 ), 



(3.8). 
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1. Set = (£/ (1) , . . . , u{, k) ) T , where {U^}^^ are i.i.d. Uniform(0, 1). 

2. For t = 0,1,2, ... , let /Q = (ot(0), . . . , 0"t(fc — 1)) T be a permutation of 
{0,1, . . . ,k — 1}, independent of all previous draws, and let 

(4.3) U * +1 = l(JC t + U*), 

where W t fc = {U^ U^ k) ) T for t > 0. 

The case of t = 1 corresponds to the original LHS. For general t we have 
the following result. 

Theorem 4. For any t>0, i,j E {1, . . . ,k}, i^j, we have: 

(i) Uf ] ~Uniform(0,l). 

(ii) Corr(^,^) = -^(l-^). 

(hi) {U£ } are NA for any finite {h, . . . , t k } ■ 



Proof, (i) For t = 0, the result obviously holds. Suppose ~ Uniform(0, 1); 
then 

[ks] 



P(ug\ <s) = P(U t {1) <ks- a t (l)) = ^-Yl P ( U < ks -j) = s 



3=0 

for any s£ (0, 1). The result thus holds by induction 



(ii) Let S t = Ej = 1 T U?. Then from the recursion formula (4.3) we 
have Var(5t) = Vax(St-i)/k i+ for all 1 < i < t, which implies, by (i) and the 

k + k(k- l)Corr([/f ) ,C/ i ( 



exchangeability of {uj; , . . . , U[ }, 



r(0 ttW\ = A_ 
k 2f 



which is just (ii). 

(hi) Because any permutation distribution is NA, (4.3) defines an an- 
tithetically coupled joint Markov chain with k marginal Markov chains. 
Furthermore, the marginal updating function from U^_\ to XJ\ is mono- 
tone. The result then follows directly from part (ii) of Theorem 1 because 
{U^\. . . , C/ (fe) } are i.i.d. and thus NA. 

□ 



For most practical purposes, Theorem 4 is all we need as it proves that 
for any T we choose, {U^\...,UP} are NA. Furthermore, as long as T is 
not too small (e.g., T > 5), they practically achieve EA because the relative 
loss of EA is k~ 2T , which approaches zero very fast (it is less than 0.02% 
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even for k = 3 and T = 5). However, in theory {Uj, , . . . , [/j* } achieves EA 

only for T = oo. This requires showing first that {U^ , . . . , U^} are well 
defined and second that they are NA. The first question is easy to answer 
but the second is not, mainly because the support of {U&> , • • • , K» } is a 
"Cantor dust" type of fractal, as investigated in Craiu and Meng (2002). 

In practice, one must choose T and k. We have already seen that large 
T results in more extreme antithesis, while the simulations performed in 
Section 2 show that as k increases, the efficiency increases too. The following 
result, proved in Appendix A. 2, further shows that, without taking into 
account the computational cost, the larger T the better. 

Theorem 5. For any monotone h £ L 2 [0, 1], the correlation Corr(/i(C/^+ 1 ), M^r+i)) 
is decreasing as a function of T. That is, for any T > 0, 

(4.4) Corr(/i(^ 1 ] 1 ), h{U^ ] +l )) < Corr^C/^), h(U®)). 

However, T and k cannot be increased indefinitely in practice. Intuitively, 
when k becomes large, the choice of T should become less important because 
a large k means we have a deep stratification even with T = 1. Consequently, 
further stratification within each stratum, which is essentially what each 
new iteration does, becomes less important. The following result, proved in 
Appendix A. 3, provides a theoretical support of this intuition by showing 
that for large k, the impact of T is negligible. 

Theorem 6. For any h e L 2 [0,1] and a fixed T, {h{U^ ] ), . . . , h(U^ k) )} 
achieves EA asymptotically as k — > +oo, that is, 

(4.5) Corr(/i(4 1} ),/i(4 2) )) = "T^T + °( k ~ 1 )- 

k — 1 

This result implies that asymptotically, as k — > oo, any ILHS iteration 
achieves EA for any square-integrable function, not just monotone func- 
tions. In practice, however, k must be finite, and often quite small for the 
sake of computational cost. For fixed k, the following result, proved in Ap- 
pendix A. 4, shows that even with k as small as 3, T does not need to be 
large in order to achieve practically the same result as T = oo, at least for 
all monotone estimand functions. 

Theorem 7. Let Dh lt h 2 (t,t + m) be the Kolmogorov-Smirnov distance 
between the two-way (marginal) joint CDF of {hi(U^ m ), ^2(4+m)} an( ^ °f 
{hi(U^),h2(U^)} ; where hi, 1 = 1,2, are nondecreasing functions. Then 

(4.6) D hlM (t,t + m) < k-^(k - l)^ t+2 l 
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This implies that if we take 

d-2log 10 (k- l) + log 10 fc 



log 10 (fc(fc-l)) 



then Dh 1: h 2 (T, oo) < 10 _d . In particular, as long as T > 5, D^^iT, oo) < 
0.0001 for any > 3. Thus, taking T between 5 and 10 is enough for almost 
any practical purpose. The generality of Theorem 7 should be emphasized 
since the same bound in (4.6) holds for any monotone hi and /i2- 

4.4. Variance reduction factors for indicator functions. As an attempt to 

find some theoretical support of the empirical findings reported in Section 2 

that the cost-effective choice of k appears to be somewhere between 3 and 

( f) 

10, we report here our findings of the theoretical value of S k as a function 
of k when / = i.i x < c \. We choose the indicator function both because of its 
analytical tractability and its practical relevance (e.g., for estimating CDF), 
and because it serves as a building block for general functions. We start with 
the LHS method, for which 

u 7) s {f) (c) -l + (k- l) F(c ' c) ~ c2 - (1 ~ {kc}){kc} 

(4.7) b k {c)-l + {k 1) c(i _ c) - kc{1 _ c) , 

where {a} denotes the fractional part of a. This result follows from the 

expression of the joint CDF, F(c,c) = P{U\ < c, U2 < c), given by (A. 6) 

( f) 

from Appendix A. 2; note (A. 5) implies that the same S): (c) holds for any 
ILHS. 

The left panel of Figure 8, realized using the freely available software RGL 

( f) 

developed by Duncan Murdoch, plots Sj. (c) as a function of both k (up to 

30) and c G (0, 1). Its fascinating shape reveals that as long as c is not too 

( f) 

close to or 1, Sj: (c) will be rather small. This is more clearly seen in the 

first two rows of Figure 9, where Sj; (c) is plotted against c for given k. It 
is intuitive that when c approaches or 1, the effect of antithetic coupling 
fades because the indicator function approaches the constant function. This 

can be formalized by considering that for 1/k < c < {k— l)/k the maximum 

( f) 

of Si (c) over this range, as shown in Appendix A. 5, is given by 

(4.8) S* k 



3k - 4 + 2 v / 2(fe-l)(/c-2) 

corresponding to the dashed lines in the first two rows in Figure 9. The 
use of a large k is seen to be twice advantageous: first, it decreases the S%, 
and second, it shrinks the area (of c) where the antithetic coupling is not 
effective. The plots also show clearly that k = 2 is least effective. 

The S* k of (4.8) decrease from S% = 1/3 to S^ = (3 + 2 V / 2)~ 1 « 0.172. 
However, the use of large k also increases the computational cost. Striving for 
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Table 2 

The minimum k a for achieving a% of maximal possible gain 
in efficiency over using k = 3 



a 


50% 


60% 


70% 


80% 


90% 


95% 


99% 




5 


6 


7 


9 


17 


32 


152 



a balance, we consider = (S3 — S^)/(S^ — S^), a measure of the relative 
gain in efficiency obtained when we increase from k = 3 to a finite k > 3 
instead of k = 00. Table 2 gives, for a few a G [0.5, 1], k a = min{fe : > a}. 

It is seen that with k = 9 we already have achieved 80% of the additional 

( f) 

gain. The rapid decrease of S^, (c) as a function of k, for fixed c, can also 
be seen in the last two rows of Figure 9. 

The above exercise can be repeated for the multivariate normal method, 
that is, when (X%,X2, ■ ■ ■ , Xk) is a multivariate normal vector with Corr(Xj, Xj) = 

— (k — and N(0, 1) margins. The calculation of Sj: is a bit more in- 
volved than in the LHS case. Specifically, with the orthogonal transformation 
Z = X\ + X2, W = X\ — X2 (i.e., Z and W are independent), we obtain 



$jt(c, c) = P(X 1 <c,X 2 <c) = P(Z - 2c < W < -Z + 2c, Z < 2c) 








Fig. 8. Variance reduction for indicator functions for Uniform(0, 1) random variables 
generated using LHS (left panel) and for normal variates (right panel). Note for better 
visualization the left panel is seen from back, with the c axis hidden. 



30 



R. V. CRAIU AND X.-L. MENG 



<f> 



2c 



where pj. = — (k — 1) 1 . Consequently, we can compute S^\c) via 

f4 10) 5 (/) f c ) - 1 + fit - n!*Mz!!M 

(4.1U) 6, {c)-l + {k l) $(c)(1 _ $(c)) - 

In the normal case, we expect a behavior of Si (c) similar to that in the 
uniform case when c is close to the limits of the range, (—00,00). However, 
for plotting purposes we use a one-to-one transformation of c, g = 3>(c), 

and we plot Si, (c) against g; this is the same as the VRF for the uniform 
variates via the normal approach, namely, if Ui = One can notice 

in the right panel of Figure 8 that everywhere but in a region of c around 

i f) 

the distribution's center, Sj; '(c) is decreasing in k. However, the decrease 
is much less abrupt than in the uniform case, which indicates that it will 




Fig. 9. Variance reduction factor for indicator function under LHS method, as a function 
of c for different values of k (first two rows), and as a function of k for different values 
of c (last two rows). 
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take a larger k to reach the same relative efficiency as given in Table 2. This 
suggests that generalizing the findings in Table tb:effl to other situations is 
by no means automatic. 

This example also shows that it is not always true that the larger k, the 
better. For example, when c = 0, using k = 2 completely eliminates the vari- 
ance because of the symmetry in / = l/ x <o}; see the right panel of Figure 8. 
This can also be verified directly from (4.10), which becomes 



S^(0) = l + (fc-l) 




k 



| 4>(z) dz 



( f) 

which is zero when k = 2. However, as k — > oo we can show that S£"(0) -» 
1 — 2/-/r = 0.3634. This reminds us that although there are good rules of 
thumb for general practice, in terms of choosing k as well as other choices 
(e.g., the generating methods) presented in this paper, one should not adopt 
them blindly without examining the special structures of the problem at 
hand. 

APPENDIX 

A.l. Proof for the unit circle example. We first note that (3.4) implies, 
for any t > 1, 

(A.l) (0f + 9j) mod(27r) = -(0*1 + Cl) mod(2vr). 

From the right panel of Figure 7, MXo = QN = QY + YoN, where AB 
denotes the counter-clockwise arc between points A and B on the unit circle. 
Furthermore, MAo = 0$ - ir, QY = 2r and YoN = 2n - 0% . Combining 
these identities, we obtain 0q + 6$ = 2r + 3tt. Consequently, (A.l) implies 
(3.5) by noting that 9? + 9j £ (0,4tt) and the "alternating" nature of (A.l), 
which is also clear from Figure 7, and thus the four possible lines in (3.5) 
are reachable from each other by (A"t, Yj). Similar arguments apply to other 
possible initial configurations of Xq and Yq (e.g., when Xq and Yq are on 
different sides of the horizontal axes). 

To prove pt(r) = — cos(2r), we observe that (3.5) implies sin 2 (#^ + 0j) = 
sin 2 (2r) for any t > 0. Using the identity sin 2 (a + (3) = cos 2 (a) + sin 2 (/3) — 
2cos(a) cos(P) cos(a + /3), we then obtain 

(A.2) cos 2 (6Y) + cos 2 (^ x ) + 2cos(0f) cos(^ x ) cos(2r) = sin 2 (2r). 

Thus the orbit of (A^ ,Yj ) = (cos(^), cos(^)) is an ellipse. Taking ex- 
pectations on both sides of (A.2) and using E[cos 2 (6 , t y )] = E[cos 2 (6' i x )] = 1/2, 
we obtain 2E[cos(0 t y ) cos(6>j )] cos(2r) = - cos 2 (2r). Together with Var(cos 2 (0 
Var(cos 2 (0^)) = 1/2, this yields pt(r) = — cos(2r) when r ^ ir/4. For r = 
vr/4, (3.5) gives 9j = (2k + l)vr ± vr/2 - 6?, and thus cos(6» t y ) = ±sm(0f ). 
Consequently, 2E[cos((9 t y ) cos(6»f )] = ±E[sin(20f )] = 0, so p t (r) = -cos(2r) 
still holds. 
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A. 2. Proof of Theorem 5. Since the (marginal) distribution of is 
Uniform(0, 1) and thus does not depend on T, by the Hoeffding identity 
(1.1), inequality (4.4) becomes immediate if we can establish 

(A.3) F$_ 1 (u,v)<fP(u,v) V(ii,u), 

where F t {h \u,v) is the joint CDF of (h(uP), h(u} 2 } )). 

We first show that (A.3) holds for h(x) = x by mathematical induction. 
Given the exchangeability, we can assume u<v, and thus [ku] < [kv], where 
[x] denotes the integer part of x. Let pij = P(J7 t < ku — i, < kv — j); 
then by the recursion (4.3), 

F t+1 (u, v) = P(U^ ] <ku- U? ) <kv- a {2) ) 
(A.4) [ku] [kv] 



k ( k -l i=0j=0 

To evaluate this expression, we consider the following possibilities according 
to the value of 

(A) When i < [ku] - 1 and j < [kv] - 1, Pij = P(C/ t (1) < 1, *7 t (2) < 1) = 1; 
there are [fcit]([A;u] — 1) + such pairs of (i, j)'s within < i ^ j < k. Recall 
(x) + = max{x, 0}. 

(B) When i < [ku] - 1 and j = [kv], Plj = P{u[ 2) < kv - [kv]) = {kv}, 
where {x} = x — [x]; there are [ku] such pairs of (i,j)'s. 

(C) When i = [ku] and j < [kv] - 1, Pij = P(U^ < ku - [ku]) = {ku}; 
there are [kv] such pairs of (i, j)'s when [ku] = [kv], but only [kv] — 1 such 
pairs when [ku] < [kv] because of the requirement that i ^ j. 

(D) When i = [ku] and j = [kv], Pij = P(U^ l) < ku - [ku],U^ 2) < kv - 
[kv]) = Ft({ku}, {kv}). There is no such pair when [ku] = [kv] and one such 
pair when [ku] < [kv]. 

Putting these four possibilities together, we have 
F t+ i(u,v) 



(A.5) 



( 0, if [kv] = 0, 

ku(kv-l)-{ku}({kv}-l) ^ if <[M = [M 

kujkv - 1) - {ku}{kv} + F t ({ku},{kv}) 

k(i~T) ' 11 [kvl > [kuV 
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From (A. 5), if (A. 3) holds for T = the rest follows immediately by induc- 
tion. For t=l, since Fo({ku},{kv}) = {ku}{kv} , we have from (A. 5) 



(A.6) Ft(u,v) 



( 0, if [kv] = 

ku{kv- 1) -{ku}{{kv} - 1) 



k(k-l) 

u(kv — 1) 
k-1 ' 



if 0< [kv] = [ku], 
if [kv] > [ku]. 



Among the three expressions above, the second one is the largest as a func- 
tion of u and v. It is easy to check that this second function is less than or 
equal to Fq(u,v) = uv if and only if 

(A.7) {ku}(l-{kv})<ku(l-v). 

Using [ku] = [kv], (A.7) is also equivalent to {kv}(u — {ku}) < [ku](l — u), 
which is obvious when u < {ku}. When < {ku} < u, {ku}(l — {kv}) < 
u{\ — {kv}). But then u(l — {kv}) < ku(l — v) is equivalent to [kv] < k — 1, 
which is obviously true except when v = 1. But the v = 1 case is trivial 
because then u = 1 in order to maintain [ku] = [kv] = k, and thus {ku} = 0, 
and hence (A.7) holds for all < u < v < 1. This proves F%(u,v) < Fq(u,v) 
for all (u,v) and hence (A. 3) by induction. 

To show that (A. 3) holds for any nondecreasing h, let x w = sup{x : h(x) < 
w} for any given w. Then {x : h(x) < w} = Aw\x), where Aw\x) ={x:x < 
x w } if h(x w ) < w and (x) = {x:x < x w } if h(x w ) > w. This means that 
the probability calculations of event {U : h(U) < w} are the same as those for 
either {U:U < x w } or {U :U < x w }. This allows us to go from Fj >(u,v) to 
Ft(x u ,x v ), for which we already have proved the desired result. A technical 
complication here is that, depending on the continuity properties of h at x u 
and x v , one or two "<" operations in the definition of Ft(x u , x v ) = P(u!f < 

(2) 

%u,Urp < x v ) may need to be replaced by "<". However, as far as (A. 3) is 
concerned, such modifications are immaterial because they do not affect any 
part of (A)-(D). Or mathematically, we have, for any given (u,v), 

F^^v) = P{h{U? +l ) < u,h(U?li) < v) 

= E[A^(U? +1 ) n 4 ft) (4 2 ]i)] < E[Ai h \u^) n 4 h) (4 2) )] 

= P{h{U ( T ] ) < u,h(U^ 2) ) <v)= F ( t\u,v), 

where the middle inequality holds because (A. 3) holds with h(x) = x, possi- 
bly with the aforementioned modifications from "<" to 
of the bivariate CDF. 
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A. 3. Proof of Theorem 6. For any h 6 L 2 [0, 1], Stein (1987) showed that 
Theorem 6 is true for the original LHS, that is, when T = 1. Therefore, 
for any monotone h £ L 2 [0, 1], Theorem 6 is a consequence of Theorem 5, 
because the smallest possible value of Con (h(U^),h(U^)) is — l/(k — 1), 
as seen in (3.9). More specifically, for any T > 1, any possible limit of — (k — 
l)CoiT(h(U^),h(U^)), as k— >oo, must be bounded below and above by 
1, and hence can only be 1. For nonmonotone h £ L 2 [Q, 1], the proof turns 
out to be much more technical. It essentially requires going through Stein's 
(1987) original arguments but using the more complicated bivariate CDF 
via (A. 5); details are given in Craiu (2001). 



A. 4. Proof of Theorem 7. First we prove (4.6) when both h\ and /12 are 
identity functions. By (A. 5), 



(A.8) D(t,t + 1) 



(A.9) 



sup 

[ku]^[kv] 



F t {{ku},{kv}) F t ^({ku},{kv}) 



< sup 



k(k-l) 
F t (u,v) F t _i(u,v) 



jfe(fc-l) 
This allows us to conclude that 



(A.10) 



D{t,t + 1) < 



k(k-l) 

D(o,r 



k(k-i) 

D{t-l,t) 
fc(Jfc-l) ' 



[fc(ife-l)]*' 

where D(0,1) = swp UtV {\F^(u,v) - F^(u,v)\} = sup u>v {uv - F^ l \u,v)}. 
By (A. 6), D(0, 1) is the maximum value of the following three suprema: 

1 

ku(kv-l) - {ku}({kv} - 1) 



(i) sup uv 

[ku]=[kv]=0 



(ii) sup 

0<[fei;] = [feu] 



UV 



sup \ 

u] I 



fku(l 



k(k-l) 
v) - {ku}(l - {kv}) 



0<[kv] = [k 



(iii) sup 

[kv]^[ku] 



uv 



ku{kv — 1) 



k(k-l) 
= sup 

[kv]^[ku] 



< 



1 



k-1 



k(k-l) 

Putting all these facts together, we have 

t+m-l 

D(t,t + m)< ]T D(i,i + 1) 

i=t 



u(l 



k-1 



k-l 



t+m-l 



< E 



1 



i=t 



1 l-[k(k-l)]- m 

k i {k - 1Y+ 1 ~ k i (k - iy+i 1 - [k(k - 1)]- 1 
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which is less than k~( l ~ l \k — l)~( i+2 ) when k>2. The extension to nonde- 
creasing functions h\,hi is immediate by defining h~ 1 (u) = sup{s : h(s) < u} 
and proceeding in the same fashion as in Appendix A. 2. That is, for any non- 
decreasing functions h\ and hi, 

D hlM (t,t + m)<snp\F t ^(hi 1 {u),h2 1 (v))-F t (h^ 1 (u),h^ 1 (v))\ 

u,v 

< D(t,t + m), 

where the definition of Ft+ m (x u ,x v ) and Ft(x u ,x v ) may need to be modified 
as in Appendix A. 2. 

A. 5. Maximum reduction factor in the uniform case. To maximize (4.7) 
subject to 1 < [kc] < k — 1, let i c = [kc] and f c = {kc}. By symmetry, the 
maximum of (4.7) occurs at either i c = 1 or i c = k — 2. When i c = 1, (4.7) 
can be expressed as 

(A.ll) Sl f \f c )~- Hl ~ fc)fc 



(l + / c )(fc-l-/ c )- 

Straightforward differentiation then shows that the maximizer must sat- 
isfy (k — 3)/^ + 2{k — l)/ c — (k — 1) = 0, which only has one acceptable 
solution f c = -^==^3—=. Therefore the maximizer of (4.7) is c\ = ^(1 + 

^==^=^ ==) with the corresponding maximal value of Sjf given in (4.8). 
The maximum from i c = k — 2 is the same, with the maximizer c\ = ^(k — 
2+ V ^ 3T ^ 
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